# Prepare Leaf margin type data from TRY for use

The leaf margin type data from TRY is a mix of three traits: It has information on the actual leaf margins, on leaf compoundness, and on overall leaf shape. This data is separated in the preprocessing script.

*If you intend to clean more than one or two traits, we recommend the use of the batch pre-processing script. Refer to the [TRY main page](try-label) for details.*

If you have questions, suggestions, spot errors, or want to contribute, get in touch with us through planthub@idiv.de.

Author: David Schellenberger Costa

## Requirements

To run the script, the following is needed:
- TRY data, available <a href="https://planthub.idiv.de/downloads/" target="_parent">here</a>
- the data.table library may need to be installed

## Code

In [None]:
# load in libraries
library(data.table) # handle large datasets

# clear workspace
rm(list = ls())


Let's get the TRY data

In [None]:
# set working directory (adapt this!)
setwd(paste0(.brd, "PlantHub"))

# read in data (adapt this!)
TRY <- fread("TRY_PlantHub.gz")

# select data of interest
TRYSubset <- TRY[TraitName == "Leaf margin type"]


To get an overview of the data, we convert values to lowercase, sort them, and show them as
a table.

In [None]:
# extract original data strings
oriVals <- TRYSubset$OrigValueStr # oriVals == original values

# change all to lowercase to ease later classification
oriVals <- tolower(oriVals)

# get an overview over the data by summarizing values and showing them in alphabetical order
valueOverview <- table(oriVals)
valueOverview[order(valueOverview)]


Some entries are coded as numbers, with information on their meaning in the original data. Also, some entries are coded as "yes". This should be decoded. We also remove remaining purely numeric values.

In [None]:
# decode coded entries
oriVals[is.na(oriVals)] <- ""
table(TRYSubset[oriVals == "other"]$OriglName)
oriVals[TRYSubset$DatasetID == 478 & TRYSubset$OrigValueStr == 0] <- "entire"
oriVals[TRYSubset$DatasetID == 478 & TRYSubset$OrigValueStr == 1] <- "undulate"
oriVals[TRYSubset$DatasetID == 478 & TRYSubset$OrigValueStr == 2] <- "sinuate"
oriVals[TRYSubset$DatasetID == 731 & TRYSubset$OrigValueStr == 0] <- "entire"
oriVals[TRYSubset$DatasetID == 733 & TRYSubset$OrigValueStr == 0] <- "entire"
oriVals[TRYSubset$DatasetID == 731 & TRYSubset$OrigValueStr == 1] <- "toothed"
oriVals[TRYSubset$DatasetID == 733 & TRYSubset$OrigValueStr == 1] <- "toothed"
oriVals[oriVals == "yes"] <- sub("Leaf margin:\\s*", "", TRYSubset[oriVals == "yes"]$OriglName)
oriVals[oriVals == "no"] <- NA # negation not helpful

# remove purely numeric values and others that have no lowercase character included
oriVals[!grepl("[[:lower:]]", oriVals)] <- NA


The most important part of the cleaning process is the definition of the search strings to look for.
We use regular expressions in some cases to be more inclusive (or exclusive).

In [None]:
# create a vector containing the search strings to look for
searchNames <- c(
	# leaf margin
	"^entire",
	"toothed|dentate|denticulate",
	"serr(ul)?ate|runcinate",
	"crenate|crenulate",
	"sinuate",
	"spiny",
	"ciliate",
	# leaf compoundness
	"lobed|lobate",
	"dissected|pinnate|pinnatifie?d|incised",
	# leaf shape
	"revolute",
	"undulate"
)


We can now search for the strings defined before and give names to the new categories, separated by the traits they will belong to. We also prepare a matrix to save new values in.

In [None]:
# search for the strings defined before
searchResults <- sapply(searchNames, grepl, oriVals, ignore.case = TRUE)

# name columns of searchResults matrix like corrected searchNames
searchResultsCols <- list()
searchResultsCols[[1]] <- c("entire", "dentate", "serrate", "crenate", "sinuate", "spiculate", "ciliate")
searchResultsCols[[2]] <- c("lobate", "compound,pinnate")
searchResultsCols[[3]] <- c("revolute", "undulate")
colnames(searchResults) <- unlist(searchResultsCols)

# prepare matrix to save new values in
newVals <- matrix(NA, length(oriVals), length(searchResultsCols))


Let's have a look at the results.

In [None]:
# show the number of matches to each category
colSums(searchResults)

# show the original entries for which no match was retrieved
sort(table(oriVals[rowSums(searchResults) < 1]))

# show the number of entries that weren't matched to any category
sum(rowSums(searchResults) < 1)

# show the number of entries that were matched to more that one category
sum(rowSums(searchResults) > 1)

# check which entries were classified into > 1 groups
table(oriVals[rowSums(searchResults) > 1])


As some values are mutually exclusive, we remove ambiguous entries.

In [None]:
# remove contradictory entries
# only one category possible
for (i in c(1:3)) {
	searchResults[
		rowSums(searchResults[, colnames(searchResults) %in% searchResultsCols[[i]]]) > 1,
		colnames(searchResults) %in% searchResultsCols[[i]]
	] <- FALSE
}


We can now use the cleaned results data to create a new data matrix, with one column for each trait.

In [None]:
# use the searchResults matrix to create new value strings by concatenating all data found
for (i in seq_along(searchResultsCols)) {
	searchResultsTemp <- searchResults[, colnames(searchResults) %in% searchResultsCols[[i]], drop = FALSE]
	newVals[, i] <- sapply(seq_len(nrow(searchResultsTemp)), function(x) {
		paste(searchResultsCols[[i]][searchResultsTemp[x, ]], collapse = ",")
	})
}
newVals[newVals == ""] <- NA


We first transfer data to the other traits we found.

In [None]:
# move values to other traits
traitNames <- c("gotoLeaf compoundness", "gotoLeaf shape")
for (i in seq_along(traitNames)) {
	if (i > 1) TRY <- rbind(TRY, TRYSubset, fill = TRUE)
	TRY[TraitName == "Leaf margin type", CleanedValueStr := newVals[, i + 1]]
	TRY[TraitName == "Leaf margin type", TraitName := traitNames[i]]
}


Now we transfer the cleaned leaf margin data.

In [None]:
# integrate into TRY
TRY <- rbind(TRY, TRYSubset, fill = TRUE)
TRY[TraitName == "Leaf margin type", CleanedValueStr := newVals[, 1]]
TRY[TraitName == "Leaf margin type", TraitName := "Leaf margin"]


As we duplicated the data to accommodate the data belonging to other traits, to avoid an unnecessary increase
in file size, we remove the rows of the duplicated data without values in the "CleanedValueStr" column.

In [None]:
TRY <- TRY[!grepl("^goto", TraitName) | !is.na(CleanedValueStr)]


We have used an existing trait name with the prefix "goto" to classify some data. This was done
to eventually move the data to the respective trait, but avoid another round of pre-processing.
So only run the following line if this is the last of various pre-processing scripts you want to use.

In [None]:
TRY[, TraitName := sub("^goto", "", TraitName)]


Let's write the data to a file.

In [None]:
# write data
fwrite(TRY, file = paste0("TRY_processed_", Sys.Date(), ".gz"))
